Statistics in Medicine
○ Wiley
Preprints posted in the last 90 days, ranked by how well they match Statistics in Medicine's content profile, based on 34 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.
Liu, Y.; Harris, R. E.; Clauw, D.; Bayman, E.; Leroux, A.; Lindquist, M. A.
Show abstract
Chronic pain is a widespread public health issue that imposes substantial health, emotional, and economic burdens on individuals and communities. Because pain is subjective and lacks objective biomarkers, it is typically measured using patient-reported scores, often on a numerical scale from zero to ten. Increasingly, pain studies use ecological momentary assessment, with multiple daily assessments over days and across study phases (e.g., a series of baseline and post-intervention assessments). These data frequently show many ratings at the extremes (i.e., at minimum or maximum pain scores), commonly referred to as zero- and one-inflation in the statistical literature, along with considerable within-person variability both within and across days. These phenomena present challenges for statistical analyses, as they violate assumptions of most commonly used statistical techniques (e.g., the normality assumption of linear mixed models). We propose a Bayesian beta-binomial mixed-effects model for modeling potential zero- or one-inflated pain scores while accounting for variability using random effects on the mean and variance parameters across subjects. A simulation study demonstrates that the method accurately estimates model parameters across realistic sample sizes, time points, and zero- and one-inflation levels. An application to data from two longitudinal pain studies demonstrates that the model fits the data better and, when correctly specified, yields accurate uncertainty intervals for longitudinal changes in pain compared to existing models, especially for zero- and one-inflated outcomes. Additionally, the model directly estimates the probability of clinically meaningful pain events. The proposed method provides a powerful statistical framework for studying the patient-reported pain trajectories.
ORWA, F. O.; Mutai, C.; Nizeyimana, I.; Mwangi, A.
Show abstract
When randomized controlled trials are impractical, interrupted time series designs offer a rigorous quasi-experimental approach to assess population level policies. Indeed, in the context of quasi-experimental designs (QEDs), the Interrupted Time Series (ITS) method is commonly thought of as the most robust. But interrupted time series designs are susceptible to serial correlation and confounding by time-varying factors associated with both the intervention and the outcome, which may result in biased inference. Thus, we provide a simulation-based contrast of controlled interrupted time series (CITS) and multivariable regression (multivariable negative binomial regression) for estimation of policy effects in count time series data. These approaches are widely used in policy evaluations, yet their comparative performance in typical population health settings has rarely been examined directly. We tested both approaches within a variety of data generating situations, differing in the series length, intervention effect size, and magnitude of lag-1 autocorrelation. Bias, standard error calibration, confidence interval coverage, mean squared error, and statistical power were assessed for performance. Both methods gave unbiased estimates for moderate and large intervention effects, although bias was more pronounced for small effects, particularly in short series. Although the point estimate performance was similar, inferential properties varied significantly. CITS always had smaller mean squared error, better consistency between model based and empirical standard errors, and confidence interval coverage near the 95% nominal levels over weak to moderate autocorrelation. By contrast, multivariable regression was more sensitive to serial dependence, leading to underestimated standard errors and undercoverage, especially at moderate to high autocorrelation, regardless of Newey-West adjustments. These findings show the benefits of using a concurrent control series and the importance of structurally accounting for serial correlation when studying population level policies with time series data.
Hripcsak, G.; Anand, T.; Chen, H. Y.; Zhang, L.; Chen, Y.; Suchard, M. A.; Ryan, P. B.; Schuemie, M. J.
Show abstract
Propensity score adjustment is commonly used in observational research to address confounding. Controversy persists about how to select covariates as possible confounders to generate the propensity model. A desire to include all possible confounders is offset by a concern that more covariates will augment bias or increase variance. Much of concern is over instruments, which are variables that affect the treatment but not the outcome. Adjusting for an instrument has been shown to increase bias due to unadjusted confounding and to increase the variance of the effect estimate. Large-scale propensity score (LSPS) adjustment includes most available pre-treatment covariates in its propensity model. It addresses instruments with a pair of diagnostics, ceasing the analysis if any covariate exceeds a correlation coefficient of 0.5 with the treatment and checking for an aggregation of instruments with equipoise reported as a preference score. Our simulation assesses the impact of adjusting for instruments in the context of LSPSs diagnostics. In our simulation, even when the variance of the treatment contributed by the adjusted instrument(s) exceeds an unadjusted confounder by over twenty-fold, when the correlation between the instrument(s) and the treatment was less than 0.5 and the equipoise was greater than 0.5, the additional shift in the effect estimate due to adjusting for the instrument(s) was less than the shift due to confounding by itself. Therefore, we find in this simulation that adjusting for instruments contributed a minor amount of bias to the effect estimate. This simulation aligns well with a previous assessment of the impact of adjusting for instruments and with separate empirical evidence that adjusting for many covariates surpasses attempts to identify a limited set of confounders.
Chen, Y.; Gui, T.; Huang, Z.; Quach, N.; Tu, S.; Liu, J.; Garrett, T. J.; Starkweather, A. R.; Lyon, D. E.; Shepherd, B. E.; Tu, X. M.; Lin, T.
Show abstract
SO_SCPLOWUMMARYC_SCPLOWChemotherapy in breast cancer (BC) can substantially affect mental wellness. Advances in metabolomics enable comprehensive profiling of metabolic changes over time during and after treatment, offering insights into biological mechanisms linking chemotherapy to mental health outcomes. To study the association between metabolite profiles and mental wellness, correlation-based analyses are particularly useful. Spearmans rho is a widely used correlation measure and popular alternative to Pearsons correlation, since it also applies to non-linear association between variables. However, existing methods are not designed for longitudinal data and do not allow for covariate adjustments. In this paper, we propose a novel regression-based framework grounded in a class of semiparametric models, the functional response models, to extend this popular correlation measure to longitudinal settings with missing data under the missing at random assumption. This framework facilitates inferences about temporal changes in correlations over time and association of explanatory variables for such changes. We use simulation studies to evaluate performance of the approach with moderate sample sizes. We apply the approach to a one-year longitudinal substudy of the EPIGEN study to examine the longitudinal association between metabolite profiles and mental wellness in BC patients undergoing chemotherapy. The identified metabolites may serve as candidates for future in-depth bioinformatics analyses and translational investigations.
Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.
Show abstract
Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.
Fayette, L.; Brendel, K.; Mentre, F.
Show abstract
Joint modelling of longitudinal data using non-linear mixed effects models and time-to-event outcomes provides a suitable framework to account for informative censoring when estimating biomarker dynamics and quantifying event risk using covariates and longitudinal trajectories. Their usefulness in clinical research depends on data collection design, particularly to precisely estimate the association (link) parameter between longitudinal and survival processes. However, optimal design strategies have so far been addressed separately for longitudinal and survival endpoints and remain unexplored for joint models. We propose two Fisher Information Matrix (FIM) computation methods for joint models, relying on Monte-Carlo integration over observations combined with either Markov Chains Monte-Carlo or Adaptive Gaussian Quadrature to integrate random effects. Their accuracy is assessed against clinical trial simulations in an oncological example based on the HORIZON III study with a tumour-growth-survival model including discrete and continuous covariates. We apply these methods to quantify the impact of follow-up duration, sampling richness, sample size, and covariate distribution on parameter uncertainty and test power. In our example, longitudinal-parameter uncertainty is barely affected by follow-up duration or sampling richness, whereas survival-parameter uncertainty decreases substantially from 1-year to 2-year follow-up. The number of subjects needed (NSN) to achieve <15\% uncertainty on the link parameter is comparable for a 2-year rich design and a 3-year sparse design. Optimal covariate distributions are stable across designs and systematically improve test power, outperforming longer and richer but non-optimised designs. These FIM-based methods accurately predict uncertainty and test powers, enabling design evaluation and NSN computation for joint-model-based clinical studies.
Webster, A. J.; Drakesmith, C. W.; Perera-Salazar, R.; Steinsaltz, D.; COMPUTE team,
Show abstract
Biomarker measurements can assist with disease diagnosis and the assessment of disease risks, with the most recent measurements usually used by disease-risk models. However, a growing number of studies suggest that in addition to a biomarkers value, its inherent variability, estimated from several measurements over many days or years in an individual, can convey independent prognostic information about disease risks. Variance estimates require an individuals biomarker data to have been measured a sufficient number of times, ideally across a long time period, and are usually only available in a hospital setting or clinical trial. Furthermore, a single biomarker measurement will involve a combination of measurement-error, natural short-term variation over a daily time-period, variation over time periods of weeks and months, and slower age-dependent changes over several years. This paper develops a statistical method that accounts for these latter concerns, and applies it to Clinical Practice Research Datalink (CPRD) data collected by UK General Practitioners. It studies the associations between cardiovascular health outcomes and the within-person variances of eight routinely measured biomarkers. This involved Sequential Monte Carlo modeling to convert an individuals biomarker measurements (collected over months or years), into estimates for the biomarkers mean, linear age-dependent slope, within-person variance, and a variance due to variation on a daily time period or measurement errors. The result is a proof-of-principle that UK primary care Electronic Health Records (from CPRD) can be effectively used for this purpose. After adjusting for mean biomarker values, clear associations were found between mortality or cardiovascular disease risks and within-person variances for 6 of 8 biomarkers.
Owusu-Boaitey, N.; Meyer, M. J.; Herrera-Esposito, D.; Bottcher, L.; Lukz, M.; Cook, S.; Stoto, M. A.; Kraemer, J. D.
Show abstract
Seroprevalence surveys reveal the extent of humoral immunity against pathogens such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and under some circumstances represent cumulative incidence of prior infection. However, antibody waning - or seroreversion - biases these estimates by reducing assay sensitivity in a time-varying manner. Because assay sensitivity decays over time, naively using serosurveys can substantially bias estimates of SARS-CoV-2 cumulative incidence and fatality rates. The Bayesian assay-specific, time-varying sensitivity adjustment developed in this paper can reliably correct for this bias and account for the delay between infection and serosurvey. In seroprevalence studies conducted in the United States in 2020, adjusting for time-varying sensitivity increased cumulative incidence by up to 1.4-fold, with an adjustment of 1.08 for a national study. Our estimates contrast with a previously published 2-fold adjustment that did not account for assay design. This suggests that previous analyses overestimated cumulative incidence by applying seroreversion corrections that did not account for assay-specific effects, or underestimated cumulative incidence by not applying seroreversion corrections. These biases imply fatality rate underestimation and overestimation, respectively. Our model provides a framework for design-specific time-varying sensitivity corrections in seroprevalence surveys for other pathogens.
Goncalves, B. P.; Franco, E. L.
Show abstract
Timeliness of therapy initiation is a fundamental determinant of outcomes for many medical conditions, most importantly, cancer. Yet, existing inefficiencies in healthcare systems mean that delays between diagnosis and treatment frequently adversely affect the clinical outcome for cancer patients. Although estimates of effects of lag time to therapy would be informative to policymakers considering resource allocation to minimize delays in oncology, causal methods are seldom explicitly discussed in epidemiologic analyses of these lag times. Here, we propose causal estimands for such studies, and outline the protocol of a target trial that could be emulated with observational data on lag times. To illustrate the application of this approach, we simulate studies of lag time to treatment under two scenarios: one in which indication bias (Waiting Time Paradox) is present and another in which it is absent. Although our discussion focuses on oncologic outcomes, components of the proposed target trial could be adapted to study delays for other medical conditions. We believe that the clarity with which causal questions are posed under the target trial emulation framework would lead to improved quantification of the effects of lag times in oncology, and hence to better informed policy decisions.
Wei, M.; Yadlapati, L.; Peng, Q.
Show abstract
BackgroundThe Adolescent Brain Cognitive Development (ABCD) Study(R) offers rich longitudinal data on environmental, genetic, and other factors related to substance use initiation. Classical marginal structural models (MSMs) require selecting covariates for propensity models, which is challenging in the presence of hundreds of correlated predictors. MethodsWe analyzed longitudinal panel data from 11,868 ABCD participants, where each individual contributed repeated observations over time. Interval-level binary outcomes were defined for initiation of alcohol, nicotine, cannabis, and any substance, restricting analyses to participants at risk prior to initiation. All predictors were constructed as lagged variables to preserve temporal ordering. We implemented a two-stage machine learning-based causal framework. First, we performed graph discovery using a Granger-inspired lagged predictive modeling approach, applying elastic-net logistic regression to identify predictive relationships between lagged environmental variables and future initiation outcomes. Robust candidate edges were selected using subject-level bootstrap stability selection. Second, we estimated adjusted effect sizes for stable edges using double machine learning (DML)-style partialling-out with cross-fitting. For each candidate predictor, the treatment was defined as the lagged variable of interest and adjusted for high-dimensional lagged covariates. Cross-fitting with group-based splitting accounted for within-subject dependence, and nuisance functions were estimated using random forest models. Cluster-robust standard errors were used for inference. ResultsWe identified a set of stable predictors across multiple domains, including sleep patterns, family environment, peer relationships, behavioral traits, and genetic risk. Many predictors were shared across substance outcomes, while some were outcome-specific. Estimated effect sizes were modest, typically ranging from -0.01 to 0.02 per standard deviation increase in the predictor. Both risk-increasing and protective associations were observed. Risk factors included sleep disturbance and behavioral risk indicators, while protective factors included parental monitoring and structured environments. ConclusionsThis study provides a practical framework for analyzing high-dimensional longitudinal data and identifying time-varying predictors of substance use initiation. The approach combines machine learning for variable selection with causal inference methods for effect estimation. The results highlight both shared and substance-specific risk factors and identify modifiable targets, such as family environment and sleep, that may inform prevention strategies.
Chen, J.; Lambe, T.; Kamau, E.; Donnelly, C.; Lambert, B.; Bajaj, S.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWSerological surveys measure the presence of antibodies in a population to infer past exposure to an infectious pathogen. If study participants ages are known, serocatalytic models can be used to retrace the historical transmission strength of a pathogen within that population, quantified by the force of infection (FOI). These models rely on age information as a key variable since infection risks are interpreted in relation to how long individuals have been at risk. However, due to data constraints, participants ages may be provided only within "age bins". A common approach is then to assign individuals ages to be midpoints of their respective age bins, ignoring uncertainty in this quantity. In this study, we quantify the bias introduced by this midpoint approach and develop a Bayesian framework that explicitly accounts for uncertainty in age. By comparing inference under constant, age-dependent, and time-dependent FOI scenarios, we show that incorporating uncertainty in age in serocatalytic models yields more reliable FOI estimates without sacrificing computational complexity. These improvements support the interpretation of serological data and inform public health decisions, such as estimating disease burden and identifying targeted vaccination groups.
Obeng-Gyasi, E.
Show abstract
Background: Mixture epidemiology deploys sophisticated estimators, Bayesian kernel machine regression with causal mediation analysis (BKMR-CMA), quantile G-computation (QGC), and parametric G-computation, alongside conventional regression. Comparative evaluations have assumed additive, non-mediated data-generating processes, leaving conditions under which estimator choice determines causal validity uncharacterized. Methods: We developed a simulation framework using military-relevant exposure distributions (metals, per- and polyfluoroalkyl substances [PFAS], polychlorinated biphenyls [PCBs]) and allostatic load (AL) across three deployment tiers, with parameters drawn from military occupational health and contamination literature. Four data-generating processes were specified as directed acyclic graphs: direct effects with confounding (M1), full mediation through AL (M2), synergistic AL-exposure interaction (M3), and collider structure (M4). We evaluated ordinary least squares (OLS), QGC, G-computation, and BKMR-CMA on bias, root mean squared error, and 95% confidence interval coverage across 500 Monte Carlo replications at n = 500 and n = 1,000. Results: No estimator dominated across all mechanisms. Under M1, OLS and G-computation produced near-identical modest positive bias; BKMR-CMA achieved lower root mean squared error through kernel shrinkage. Under M2, BKMR-CMA exhibited severe positive bias for AL (mean bias = +0.579 SD units; coverage = 32.8%). Under M3, BKMR-CMA was the only estimator achieving nominal 95% coverage for AL (95.2%), while regression-based approaches fell to 83.6%. Under M4, G-computation produced persistent bias and near-zero coverage for lead, reflecting structural non-identification. Conclusions: Estimator validity is fundamentally mechanism-dependent. Researchers should base estimator choice on explicit causal assumptions about whether AL functions as confounder, mediator, moderator, or collider, particularly in military and occupational cohorts. We provide a mechanism-to-estimator mapping for applied researchers.
Wei, M.; Zhang, H.; Peng, Q.
Show abstract
BackgroundEarly initiation of substance use is linked to later adverse outcomes, and risk factors come from multiple domains and are shared across substances. In our previous work, traditional time-to-event Cox models identified individual risk factors, but these models are not designed to jointly model multiple outcomes or capture complex non-linear relationships. Multi-task learning (MTL) can leverage shared structure across related outcomes to improve prediction and distinguish common versus substance-specific predictors. However, most MTL studies rely on baseline features and focus on single outcomes, which limits their ability to capture shared risk and temporal changes. Substance use initiation is a time-dependent process that unfolds during development and reflects changing exposures over time. Baseline-only models cannot capture these changes or represent risk dynamics. Discrete-time modeling provides a practical approach by estimating interval-level initiation risk and combining it into cumulative risk at the subject level. By integrating multi-task learning with dynamic modeling, it is possible to share information across outcomes while capturing how risk evolves over time, which may improve prediction performance. MethodsUsing the Adolescent Brain Cognitive Development (ABCD) Study(R) (release 5.1), we developed two complementary multi-task learning (MTL) frameworks to predict initiation of alcohol, nicotine, cannabis, and any substance use. A baseline MTL model predicted fixedhorizon (48-month) initiation using one record per participant, while a dynamic discrete-time MTL model incorporated longitudinal interval data to model time-varying risk. Both models used multi-domain environmental exposures, core covariates, and polygenic risk scores (PRS). Performance was evaluated on a held-out test set using AUROC, PR-AUC, and calibration metrics, and compared with single-task logistic regression (LR). Feature importance was assessed using permutation importance and compared with Cox proportional hazards models. ResultsMTL showed comparable or improved performance relative to LR, with larger gains for low-prevalence outcomes (cannabis and nicotine). Incorporating longitudinal information led to consistent improvements across all outcomes. Dynamic models increased AUROC by +0.044 to +0.062 for MTL and +0.050 to +0.084 for LR, indicating that temporal information was the primary driver of performance gains. Feature importance analyses showed modest overlap across methods, with higher agreement between dynamic MTL and Cox models than static MTL. A small set of features, including externalizing behavior, parental monitoring, and developmental factors, were consistently identified across all approaches. ConclusionsDynamic multi-task learning improves the prediction of substance use initiation by leveraging longitudinal structure and shared information across outcomes. While MTL provides additional gains, incorporating time-varying information is the dominant factor for improving performance. Combining baseline and dynamic frameworks offers a comprehensive strategy for identifying robust risk factors and modeling adolescent substance use initiation.
Jackson, K. C.; Carilli, M. T.; Pachter, L.
Show abstract
Contrastive principal component analysis (PCA) methods are effective approaches to dimensionality reduction where variance of a target dataset is maximized while variance of a background dataset is minimized. We previously described how contrastive PCA problems can be written as solutions to generalized eigenvalue problems that maximize particular instantiations of the Rayleigh quotient. Here, we discuss two extensions of contrastive PCA: we use kernel weighting from spatial PCA (k-{rho}PCA) to contrast spatial and non-spatial axes of variation, and separately solve the Rayleigh quotient in the space of basis function coefficients (f-{rho}PCA) to find modes of variation in functional data. Together, these extensions expand the scope of contrastive PCA while unifying disparate fields of spatial and functional methods within a single conceptual and mathematical framework. We showcase the utility of these extensions with several examples drawn from genomics, analyzing gene expression in cancer and immune response to vaccination.
Li, Y.; Cabral, H.; Tripodis, Y.; Ma, J.; Levy, D.; Joehanes, R.; Liu, C.; Lee, J.
Show abstract
Mediation analysis quantifies how an exposure affects an outcome through an intermediate variable. We extend mediation analysis to capture the cumulative effects of longitudinal predictors on longitudinal outcomes. Our proposed model examines how mediators transmit the effects of the current and previous exposure on the current outcome. We construct a least-squared estimator for cumulative indirect effect (CIE) and used three approaches (exact form, delta method, and bootstrap procedure) to estimate its standard error (SE). The estimator of CIE is unbiased with no unmeasured confounding and independent model errors between mediator model and outcome model at all time points, as shown in statistical inference and in simulations. While three SE estimates are numerically similar, bootstrap procedure is recommended due to its simplicity in implementation. We apply this method to Framingham Heart Study offspring cohort to assess if DNA methylation mediates the association of alcohol consumption with systolic blood pressure over two time points. We identify two CpGs (cg05130679 and cg05465916) as mediators and construct a composite DNA methylation score from 11 CpGs, which mediates for 39% of the cumulative effect. In conclusion, we propose an unbiased estimator for CIE. Future studies will investigate the missingness in mediators and outcomes.
Kleper, S. L.; Melamed, R. D.
Show abstract
Machine learning models for causal inference aim to adjust for confounding factors that are associated with both an exposure and an outcome, creating a spurious biased association. But, these methods are rarely empirically evaluated to assess their success in mitigating such bias. Recent advances in knowledge representation, including both foundation models and knowledge graphs, could enrich these models, but rigorous evaluations are needed in order to assess their potential. Here, we ask whether enriching existing causal inference models with knowledge representations from foundation models can improve confounding control. Rather than using semi-simulated data to address this question, we focus on examples of real confounding: we emulate target randomized active comparator trials that are subject to confounding by indication. Our results can guide researchers aiming to develop or apply methods for discovering causal effects from observational data.
Shukla, N.; Bartington, S. E.; Hansell, A. L.; Lucas, T. C.
Show abstract
Background: In the absence of high-resolution response data, exposure-response modelling often relies on aggregated low-frequency exposure data, leading to loss of high-resolution information. Mixed Data Sampling (MIDAS) from econometrics offers an alternative but is limited due to its inability to make high-resolution predictions, inflexible likelihoods and penalised nonlinear functions, and limited visualization options. We propose a mixed-frequency Distributed Lag Non-linear Model (mf-DLNM) which can eliminate the need to aggregate exposure data in environmental epidemiology and provide high resolution predictions for time series studies. Methods: We evaluated the inference and predictive performance of the mf-DLNM. To evaluate its ability to estimate exposure-response relationships, we applied mf-DLNM and same-frequency (sf)-DLNM using data from the West Midlands, UK. Additionally, we compared the predictive performance of mf-DLNM with sf-DLNM and MIDAS across nine regions of England. As MIDAS cannot predict at the resolution of the predictor (daily), we compared the predictive performance of mf-DLNM and MIDAS at weekly resolution. To test the model's ability to predict high temporal resolution risk (daily), we compared sf-DLNM (with access to daily mortality counts) with mf-DLNM (with access only to weekly mortality counts). Results: In the West Midlands example, mf-DLNM performed comparably to sf-DLNM in estimating daily risk of temperature on respiratory mortality. Furthermore, mf-DLNM and MIDAS exhibited similar performance for weekly predictions. For high-resolution predictions, mf-DLNM and sf-DLNM showed nearly similar performance, despite mf-DLNM having access only to low-resolution response data. Conclusion: This mixed-frequency approach in environmental epidemiology overcomes the limitations of predicting health risks using aggregated exposure data and provides estimates of high-resolution outcomes in the absence of high-frequency health outcome datasets.
Lan, Y.; Wu, C.-Y.; Lin, H.-H.; Cohen, T.; Warren, J. L.
Show abstract
Pairwise analysis of genomic and spatial data offers opportunities to identify and estimate the associations between covariates and the transmission of pathogens between individuals. However, such pairwise analyses are computationally intensive, and may not be feasible to conduct given the high dyad count in even moderately sized datasets. Here we compare two approaches to increase the efficiency of pairwise analysis for large datasets. We quantify and compare the performance of divide-and-conquer Bayesian model fitting and pairwise case-control approaches for estimating associations between individual- and pair-level covariates and shared membership in a transmission cluster. We utilize a large dataset (n=4,154) of spatially-referenced, genomically-sequenced Mycobacterium tuberculosis isolates collected from a single city for this analysis. We find that the case-control approach produces unbiased estimates of effect sizes with expected credible interval coverage and is more robust than the divide-and-conquer method when effect sizes are large. Thus, we recommend using the case-control approach with at least three controls per case to downscale datasets for pairwise analysis when analysis of the entire dataset is not possible. This approach mitigates the computational challenges of pairwise Bayesian modeling on datasets that require significant computational resources while maintaining desired inferential properties. Author SummaryPairwise analyses of large datasets to study pathogen transmission are computationally demanding because they typically require simultaneous analysis of each possible pair of individuals in a dataset; as datasets become larger these analyses often are not feasible to conduct even with access to high-performance computing resources. In this work, we compare a case-control approach and divide-and-conquer approaches for more efficient pairwise analysis of large datasets. Using a large dataset of Mycobacterium tuberculosis isolates including genetic and spatial data, we investigate the performance of each method for estimating the associations between host covariates and genetic clustering of isolates. We find that the case-control approach is generally preferred over methods which first divide the data into subsets and then combine results. While additional extensions of these analyses are needed to test the generality of these findings to other data settings, this work provides a practical way forward for the pairwise analysis of large datasets to study pathogen transmission.
XIE, R.; Gascuel, O.; ZHUKOVA, A.
Show abstract
Phylodynamics bridges the gap between epidemiology and pathogen genetic data by estimating epidemiological parameters from time-scaled pathogen phylogenies. Multi-type birth-death (MTBD) models are phylodynamic analogies of compartmental models in classical epidemiology. They serve to infer the average number of secondary infections R and the infection duration d. Moreover, more complex MTBD models add extra parameters, such as the average length of the incubation period or the proportion of superspreaders in the infected population. However, these additional parameters come at an important computational cost: Apart from the simplest, BD, model, MTBD models do not have a closed-form solution and require numerical methods for their likelihood computation. This leads to increased computational times and potential numerical errors. Therefore, the BD model remains the favorite researchers choice for real dataset analyses, and is often applied even in cases where more complex epidemiological aspects are present. We investigated, using simulations, how model misspecification influences inference of R and d in the phylodynamic framework. We showed that the use of models not accounting for various epidemiological aspects leads to bias. In particular the simplest, BD, estimator tends to underestimate R in the presence of super-spreading or incubation, which might be dangerous from the public health prospective. However, deep-learning-based estimators for complex models, which account for multiple epidemiological factors, perform well both on the data where those factors are present and where they are absent. This advocates for the use of complex epidemiologically realistic estimators, whose design has recently become possible thanks to deep learning.
Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.
Show abstract
Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.